{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  贝叶斯拼写检查器 ###"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re, collections\n",
    " \n",
    "def words(text): return re.findall('[a-z]+', text.lower()) \n",
    " \n",
    "def train(features):\n",
    "    model = collections.defaultdict(lambda: 1)\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "    return model\n",
    " \n",
    "NWORDS = train(words(open('big.txt').read()))\n",
    " \n",
    "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n",
    " \n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion\n",
    "               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition\n",
    "               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration\n",
    "               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion\n",
    " \n",
    "def known_edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)\n",
    " \n",
    "def known(words): return set(w for w in words if w in NWORDS)\n",
    " \n",
    "def correct(word):\n",
    "    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n",
    "    return max(candidates, key=lambda w: NWORDS[w])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'far'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#appl #appla #learw #tess #morw\n",
    "correct('fai')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 求解：argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w) ###\n",
    "\n",
    "* P(c), 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大\n",
    "* P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w\n",
    "* argmaxc, 用来枚举所有可能的 c 并且选取概率最大的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号\n",
    "def words(text): return re.findall('[a-z]+', text.lower()) \n",
    " \n",
    "def train(features):\n",
    "    model = collections.defaultdict(lambda: 1)\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "    return model\n",
    " \n",
    "NWORDS = train(words(open('big.txt').read()))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中. 于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中, 我们期望用一个很小的概率来代表这种情况. lambda: 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(<function __main__.train.<locals>.<lambda>()>,\n",
       "            {'the': 80031,\n",
       "             'project': 289,\n",
       "             'gutenberg': 264,\n",
       "             'ebook': 88,\n",
       "             'of': 40026,\n",
       "             'adventures': 18,\n",
       "             'sherlock': 102,\n",
       "             'holmes': 468,\n",
       "             'by': 6739,\n",
       "             'sir': 178,\n",
       "             'arthur': 35,\n",
       "             'conan': 5,\n",
       "             'doyle': 6,\n",
       "             'in': 22048,\n",
       "             'our': 1067,\n",
       "             'series': 129,\n",
       "             'copyright': 70,\n",
       "             'laws': 234,\n",
       "             'are': 3631,\n",
       "             'changing': 45,\n",
       "             'all': 4145,\n",
       "             'over': 1283,\n",
       "             'world': 363,\n",
       "             'be': 6156,\n",
       "             'sure': 124,\n",
       "             'to': 28767,\n",
       "             'check': 39,\n",
       "             'for': 6940,\n",
       "             'your': 1280,\n",
       "             'country': 424,\n",
       "             'before': 1364,\n",
       "             'downloading': 6,\n",
       "             'or': 5353,\n",
       "             'redistributing': 8,\n",
       "             'this': 4064,\n",
       "             'any': 1205,\n",
       "             'other': 1503,\n",
       "             'header': 8,\n",
       "             'should': 1298,\n",
       "             'first': 1178,\n",
       "             'thing': 304,\n",
       "             'seen': 445,\n",
       "             'when': 2924,\n",
       "             'viewing': 8,\n",
       "             'file': 22,\n",
       "             'please': 173,\n",
       "             'do': 1504,\n",
       "             'not': 6626,\n",
       "             'remove': 54,\n",
       "             'it': 10682,\n",
       "             'change': 151,\n",
       "             'edit': 5,\n",
       "             'without': 1016,\n",
       "             'written': 118,\n",
       "             'permission': 53,\n",
       "             'read': 219,\n",
       "             'legal': 53,\n",
       "             'small': 528,\n",
       "             'print': 48,\n",
       "             'and': 38313,\n",
       "             'information': 74,\n",
       "             'about': 1498,\n",
       "             'at': 6792,\n",
       "             'bottom': 43,\n",
       "             'included': 44,\n",
       "             'is': 9775,\n",
       "             'important': 286,\n",
       "             'specific': 38,\n",
       "             'rights': 169,\n",
       "             'restrictions': 24,\n",
       "             'how': 1316,\n",
       "             'may': 2552,\n",
       "             'used': 277,\n",
       "             'you': 5623,\n",
       "             'can': 1096,\n",
       "             'also': 779,\n",
       "             'find': 295,\n",
       "             'out': 1988,\n",
       "             'make': 505,\n",
       "             'a': 21156,\n",
       "             'donation': 11,\n",
       "             'get': 469,\n",
       "             'involved': 108,\n",
       "             'welcome': 19,\n",
       "             'free': 422,\n",
       "             'plain': 109,\n",
       "             'vanilla': 7,\n",
       "             'electronic': 59,\n",
       "             'texts': 8,\n",
       "             'ebooks': 55,\n",
       "             'readable': 14,\n",
       "             'both': 530,\n",
       "             'humans': 3,\n",
       "             'computers': 8,\n",
       "             'since': 261,\n",
       "             'these': 1232,\n",
       "             'were': 4290,\n",
       "             'prepared': 139,\n",
       "             'thousands': 94,\n",
       "             'volunteers': 23,\n",
       "             'title': 40,\n",
       "             'author': 30,\n",
       "             'release': 29,\n",
       "             'date': 49,\n",
       "             'march': 136,\n",
       "             'most': 909,\n",
       "             'recently': 31,\n",
       "             'updated': 5,\n",
       "             'november': 42,\n",
       "             'edition': 22,\n",
       "             'language': 62,\n",
       "             'english': 212,\n",
       "             'character': 175,\n",
       "             'set': 325,\n",
       "             'encoding': 6,\n",
       "             'ascii': 12,\n",
       "             'start': 68,\n",
       "             'additional': 31,\n",
       "             'editing': 7,\n",
       "             'jose': 2,\n",
       "             'menendez': 2,\n",
       "             'contents': 51,\n",
       "             'i': 7683,\n",
       "             'scandal': 20,\n",
       "             'bohemia': 16,\n",
       "             'ii': 78,\n",
       "             'red': 289,\n",
       "             'headed': 38,\n",
       "             'league': 54,\n",
       "             'iii': 92,\n",
       "             'case': 439,\n",
       "             'identity': 12,\n",
       "             'iv': 56,\n",
       "             'boscombe': 17,\n",
       "             'valley': 79,\n",
       "             'mystery': 40,\n",
       "             'v': 52,\n",
       "             'five': 280,\n",
       "             'orange': 24,\n",
       "             'pips': 13,\n",
       "             'vi': 38,\n",
       "             'man': 1653,\n",
       "             'with': 9741,\n",
       "             'twisted': 22,\n",
       "             'lip': 57,\n",
       "             'vii': 35,\n",
       "             'adventure': 35,\n",
       "             'blue': 144,\n",
       "             'carbuncle': 18,\n",
       "             'viii': 40,\n",
       "             'speckled': 6,\n",
       "             'band': 55,\n",
       "             'ix': 29,\n",
       "             'engineer': 13,\n",
       "             's': 5632,\n",
       "             'thumb': 52,\n",
       "             'x': 137,\n",
       "             'noble': 49,\n",
       "             'bachelor': 19,\n",
       "             'xi': 29,\n",
       "             'beryl': 5,\n",
       "             'coronet': 30,\n",
       "             'xii': 29,\n",
       "             'copper': 27,\n",
       "             'beeches': 13,\n",
       "             'she': 3947,\n",
       "             'always': 609,\n",
       "             'woman': 326,\n",
       "             'have': 3494,\n",
       "             'seldom': 77,\n",
       "             'heard': 637,\n",
       "             'him': 5231,\n",
       "             'mention': 47,\n",
       "             'her': 5285,\n",
       "             'under': 964,\n",
       "             'name': 263,\n",
       "             'his': 10035,\n",
       "             'eyes': 940,\n",
       "             'eclipses': 3,\n",
       "             'predominates': 4,\n",
       "             'whole': 745,\n",
       "             'sex': 12,\n",
       "             'was': 11411,\n",
       "             'that': 12513,\n",
       "             'he': 12402,\n",
       "             'felt': 698,\n",
       "             'emotion': 37,\n",
       "             'akin': 15,\n",
       "             'love': 485,\n",
       "             'irene': 19,\n",
       "             'adler': 17,\n",
       "             'emotions': 11,\n",
       "             'one': 3372,\n",
       "             'particularly': 175,\n",
       "             'abhorrent': 2,\n",
       "             'cold': 258,\n",
       "             'precise': 14,\n",
       "             'but': 5654,\n",
       "             'admirably': 8,\n",
       "             'balanced': 7,\n",
       "             'mind': 342,\n",
       "             'take': 617,\n",
       "             'perfect': 40,\n",
       "             'reasoning': 42,\n",
       "             'observing': 22,\n",
       "             'machine': 40,\n",
       "             'has': 1604,\n",
       "             'as': 8065,\n",
       "             'lover': 27,\n",
       "             'would': 1954,\n",
       "             'placed': 183,\n",
       "             'himself': 1159,\n",
       "             'false': 65,\n",
       "             'position': 433,\n",
       "             'never': 594,\n",
       "             'spoke': 219,\n",
       "             'softer': 11,\n",
       "             'passions': 30,\n",
       "             'save': 111,\n",
       "             'gibe': 3,\n",
       "             'sneer': 7,\n",
       "             'they': 3939,\n",
       "             'admirable': 15,\n",
       "             'things': 322,\n",
       "             'observer': 14,\n",
       "             'excellent': 63,\n",
       "             'drawing': 241,\n",
       "             'veil': 17,\n",
       "             'from': 5710,\n",
       "             'men': 1146,\n",
       "             'motives': 15,\n",
       "             'actions': 78,\n",
       "             'trained': 24,\n",
       "             'reasoner': 7,\n",
       "             'admit': 66,\n",
       "             'such': 1437,\n",
       "             'intrusions': 2,\n",
       "             'into': 2125,\n",
       "             'own': 786,\n",
       "             'delicate': 55,\n",
       "             'finely': 12,\n",
       "             'adjusted': 17,\n",
       "             'temperament': 6,\n",
       "             'introduce': 24,\n",
       "             'distracting': 2,\n",
       "             'factor': 42,\n",
       "             'which': 4843,\n",
       "             'might': 537,\n",
       "             'throw': 49,\n",
       "             'doubt': 153,\n",
       "             'upon': 1112,\n",
       "             'mental': 38,\n",
       "             'results': 230,\n",
       "             'grit': 2,\n",
       "             'sensitive': 36,\n",
       "             'instrument': 36,\n",
       "             'crack': 21,\n",
       "             'high': 291,\n",
       "             'power': 549,\n",
       "             'lenses': 2,\n",
       "             'more': 1998,\n",
       "             'disturbing': 10,\n",
       "             'than': 1207,\n",
       "             'strong': 169,\n",
       "             'nature': 171,\n",
       "             'yet': 489,\n",
       "             'there': 2973,\n",
       "             'late': 166,\n",
       "             'dubious': 2,\n",
       "             'questionable': 4,\n",
       "             'memory': 56,\n",
       "             'had': 7384,\n",
       "             'little': 1002,\n",
       "             'lately': 23,\n",
       "             'my': 2250,\n",
       "             'marriage': 97,\n",
       "             'drifted': 6,\n",
       "             'us': 685,\n",
       "             'away': 839,\n",
       "             'each': 412,\n",
       "             'complete': 146,\n",
       "             'happiness': 144,\n",
       "             'home': 296,\n",
       "             'centred': 3,\n",
       "             'interests': 119,\n",
       "             'rise': 241,\n",
       "             'up': 2285,\n",
       "             'around': 272,\n",
       "             'who': 3051,\n",
       "             'finds': 24,\n",
       "             'master': 142,\n",
       "             'establishment': 41,\n",
       "             'sufficient': 76,\n",
       "             'absorb': 5,\n",
       "             'attention': 192,\n",
       "             'while': 769,\n",
       "             'loathed': 2,\n",
       "             'every': 651,\n",
       "             'form': 508,\n",
       "             'society': 170,\n",
       "             'bohemian': 9,\n",
       "             'soul': 169,\n",
       "             'remained': 232,\n",
       "             'lodgings': 12,\n",
       "             'baker': 50,\n",
       "             'street': 181,\n",
       "             'buried': 22,\n",
       "             'among': 452,\n",
       "             'old': 1181,\n",
       "             'books': 60,\n",
       "             'alternating': 3,\n",
       "             'week': 96,\n",
       "             'between': 655,\n",
       "             'cocaine': 5,\n",
       "             'ambition': 14,\n",
       "             'drowsiness': 5,\n",
       "             'drug': 22,\n",
       "             'fierce': 13,\n",
       "             'energy': 46,\n",
       "             'keen': 33,\n",
       "             'still': 923,\n",
       "             'ever': 275,\n",
       "             'deeply': 78,\n",
       "             'attracted': 37,\n",
       "             'study': 145,\n",
       "             'crime': 62,\n",
       "             'occupied': 117,\n",
       "             'immense': 78,\n",
       "             'faculties': 9,\n",
       "             'extraordinary': 75,\n",
       "             'powers': 150,\n",
       "             'observation': 40,\n",
       "             'following': 209,\n",
       "             'those': 1202,\n",
       "             'clues': 4,\n",
       "             'clearing': 30,\n",
       "             'mysteries': 10,\n",
       "             'been': 2600,\n",
       "             'abandoned': 73,\n",
       "             'hopeless': 18,\n",
       "             'official': 92,\n",
       "             'police': 95,\n",
       "             'time': 1530,\n",
       "             'some': 1537,\n",
       "             'vague': 40,\n",
       "             'account': 178,\n",
       "             'doings': 12,\n",
       "             'summons': 12,\n",
       "             'odessa': 4,\n",
       "             'trepoff': 2,\n",
       "             'murder': 31,\n",
       "             'singular': 37,\n",
       "             'tragedy': 10,\n",
       "             'atkinson': 2,\n",
       "             'brothers': 51,\n",
       "             'trincomalee': 2,\n",
       "             'finally': 157,\n",
       "             'mission': 35,\n",
       "             'accomplished': 40,\n",
       "             'so': 3018,\n",
       "             'delicately': 4,\n",
       "             'successfully': 26,\n",
       "             'reigning': 4,\n",
       "             'family': 211,\n",
       "             'holland': 13,\n",
       "             'beyond': 226,\n",
       "             'signs': 99,\n",
       "             'activity': 132,\n",
       "             'however': 431,\n",
       "             'merely': 190,\n",
       "             'shared': 26,\n",
       "             'readers': 12,\n",
       "             'daily': 45,\n",
       "             'press': 82,\n",
       "             'knew': 497,\n",
       "             'former': 178,\n",
       "             'friend': 284,\n",
       "             'companion': 82,\n",
       "             'night': 386,\n",
       "             'on': 6644,\n",
       "             'twentieth': 20,\n",
       "             'returning': 69,\n",
       "             'journey': 70,\n",
       "             'patient': 384,\n",
       "             'now': 1698,\n",
       "             'returned': 195,\n",
       "             'civil': 178,\n",
       "             'practice': 96,\n",
       "             'way': 860,\n",
       "             'led': 197,\n",
       "             'me': 1921,\n",
       "             'through': 816,\n",
       "             'passed': 368,\n",
       "             'well': 1199,\n",
       "             'remembered': 121,\n",
       "             'door': 499,\n",
       "             'must': 956,\n",
       "             'associated': 197,\n",
       "             'wooing': 3,\n",
       "             'dark': 182,\n",
       "             'incidents': 15,\n",
       "             'scarlet': 23,\n",
       "             'seized': 115,\n",
       "             'desire': 97,\n",
       "             'see': 1102,\n",
       "             'again': 867,\n",
       "             'know': 1049,\n",
       "             'employing': 8,\n",
       "             'rooms': 87,\n",
       "             'brilliantly': 6,\n",
       "             'lit': 75,\n",
       "             'even': 947,\n",
       "             'looked': 761,\n",
       "             'saw': 600,\n",
       "             'tall': 75,\n",
       "             'spare': 28,\n",
       "             'figure': 104,\n",
       "             'pass': 155,\n",
       "             'twice': 85,\n",
       "             'silhouette': 2,\n",
       "             'against': 661,\n",
       "             'blind': 24,\n",
       "             'pacing': 27,\n",
       "             'room': 961,\n",
       "             'swiftly': 39,\n",
       "             'eagerly': 40,\n",
       "             'head': 726,\n",
       "             'sunk': 28,\n",
       "             'chest': 82,\n",
       "             'hands': 456,\n",
       "             'clasped': 12,\n",
       "             'behind': 402,\n",
       "             'mood': 52,\n",
       "             'habit': 56,\n",
       "             'attitude': 73,\n",
       "             'manner': 136,\n",
       "             'told': 491,\n",
       "             'their': 2956,\n",
       "             'story': 134,\n",
       "             'work': 383,\n",
       "             'risen': 31,\n",
       "             'created': 63,\n",
       "             'dreams': 17,\n",
       "             'hot': 120,\n",
       "             'scent': 18,\n",
       "             'new': 1212,\n",
       "             'problem': 77,\n",
       "             'rang': 30,\n",
       "             'bell': 66,\n",
       "             'shown': 114,\n",
       "             'chamber': 36,\n",
       "             'formerly': 78,\n",
       "             'part': 705,\n",
       "             'effusive': 3,\n",
       "             'glad': 151,\n",
       "             'think': 558,\n",
       "             'hardly': 174,\n",
       "             'word': 299,\n",
       "             'spoken': 93,\n",
       "             'kindly': 87,\n",
       "             'eye': 111,\n",
       "             'waved': 30,\n",
       "             'an': 3424,\n",
       "             'armchair': 50,\n",
       "             'threw': 97,\n",
       "             'across': 223,\n",
       "             'cigars': 8,\n",
       "             'indicated': 89,\n",
       "             'spirit': 168,\n",
       "             'gasogene': 2,\n",
       "             'corner': 129,\n",
       "             'then': 1559,\n",
       "             'stood': 384,\n",
       "             'fire': 275,\n",
       "             'introspective': 4,\n",
       "             'fashion': 50,\n",
       "             'wedlock': 2,\n",
       "             'suits': 9,\n",
       "             'remarked': 170,\n",
       "             'watson': 84,\n",
       "             'put': 436,\n",
       "             'seven': 133,\n",
       "             'half': 319,\n",
       "             'pounds': 27,\n",
       "             'answered': 227,\n",
       "             'indeed': 140,\n",
       "             'thought': 903,\n",
       "             'just': 768,\n",
       "             'trifle': 12,\n",
       "             'fancy': 51,\n",
       "             'observe': 38,\n",
       "             'did': 1876,\n",
       "             'tell': 493,\n",
       "             'intended': 59,\n",
       "             'go': 906,\n",
       "             'harness': 28,\n",
       "             'deduce': 15,\n",
       "             'getting': 93,\n",
       "             'yourself': 163,\n",
       "             'very': 1341,\n",
       "             'wet': 61,\n",
       "             'clumsy': 9,\n",
       "             'careless': 15,\n",
       "             'servant': 47,\n",
       "             'girl': 167,\n",
       "             'dear': 450,\n",
       "             'said': 3465,\n",
       "             'too': 549,\n",
       "             'much': 672,\n",
       "             'certainly': 120,\n",
       "             'burned': 78,\n",
       "             'lived': 114,\n",
       "             'few': 459,\n",
       "             'centuries': 13,\n",
       "             'ago': 109,\n",
       "             'true': 206,\n",
       "             'walk': 76,\n",
       "             'thursday': 8,\n",
       "             'came': 980,\n",
       "             'dreadful': 69,\n",
       "             'mess': 11,\n",
       "             'changed': 135,\n",
       "             'clothes': 63,\n",
       "             't': 1319,\n",
       "             'imagine': 97,\n",
       "             'mary': 706,\n",
       "             'jane': 3,\n",
       "             'incorrigible': 3,\n",
       "             'wife': 368,\n",
       "             'given': 365,\n",
       "             'notice': 99,\n",
       "             'fail': 41,\n",
       "             'chuckled': 8,\n",
       "             'rubbed': 33,\n",
       "             'long': 992,\n",
       "             'nervous': 55,\n",
       "             'together': 261,\n",
       "             'simplicity': 31,\n",
       "             'itself': 274,\n",
       "             'inside': 44,\n",
       "             'left': 835,\n",
       "             'shoe': 12,\n",
       "             'where': 978,\n",
       "             'firelight': 3,\n",
       "             'strikes': 20,\n",
       "             'leather': 36,\n",
       "             'scored': 5,\n",
       "             'six': 177,\n",
       "             'almost': 326,\n",
       "             'parallel': 18,\n",
       "             'cuts': 6,\n",
       "             'obviously': 39,\n",
       "             'caused': 103,\n",
       "             'someone': 161,\n",
       "             'carelessly': 15,\n",
       "             'scraped': 22,\n",
       "             'round': 557,\n",
       "             'edges': 71,\n",
       "             'sole': 71,\n",
       "             'order': 405,\n",
       "             'crusted': 3,\n",
       "             'mud': 37,\n",
       "             'hence': 33,\n",
       "             'double': 50,\n",
       "             'deduction': 13,\n",
       "             'vile': 17,\n",
       "             'weather': 43,\n",
       "             'malignant': 89,\n",
       "             'boot': 23,\n",
       "             'slitting': 3,\n",
       "             'specimen': 15,\n",
       "             'london': 77,\n",
       "             'slavey': 2,\n",
       "             'if': 2373,\n",
       "             'gentleman': 100,\n",
       "             'walks': 11,\n",
       "             'smelling': 6,\n",
       "             'iodoform': 44,\n",
       "             'black': 236,\n",
       "             'mark': 39,\n",
       "             'nitrate': 8,\n",
       "             'silver': 129,\n",
       "             'right': 711,\n",
       "             'forefinger': 8,\n",
       "             'bulge': 3,\n",
       "             'side': 512,\n",
       "             'top': 43,\n",
       "             'hat': 106,\n",
       "             'show': 214,\n",
       "             'secreted': 3,\n",
       "             'stethoscope': 3,\n",
       "             'dull': 75,\n",
       "             'pronounce': 10,\n",
       "             'active': 97,\n",
       "             'member': 51,\n",
       "             'medical': 23,\n",
       "             'profession': 23,\n",
       "             'could': 1701,\n",
       "             'help': 231,\n",
       "             'laughing': 116,\n",
       "             'ease': 45,\n",
       "             'explained': 61,\n",
       "             'process': 220,\n",
       "             'hear': 184,\n",
       "             'give': 524,\n",
       "             'reasons': 65,\n",
       "             'appears': 109,\n",
       "             'ridiculously': 2,\n",
       "             'simple': 140,\n",
       "             'easily': 115,\n",
       "             'myself': 228,\n",
       "             'though': 651,\n",
       "             'successive': 18,\n",
       "             'instance': 51,\n",
       "             'am': 747,\n",
       "             'baffled': 9,\n",
       "             'until': 326,\n",
       "             'explain': 124,\n",
       "             'believe': 184,\n",
       "             'good': 745,\n",
       "             'yours': 47,\n",
       "             'quite': 503,\n",
       "             'lighting': 17,\n",
       "             'cigarette': 7,\n",
       "             'throwing': 47,\n",
       "             'down': 1129,\n",
       "             'distinction': 20,\n",
       "             'clear': 234,\n",
       "             'example': 287,\n",
       "             'frequently': 219,\n",
       "             'steps': 189,\n",
       "             'lead': 138,\n",
       "             'hall': 84,\n",
       "             'often': 444,\n",
       "             'hundreds': 49,\n",
       "             'times': 237,\n",
       "             'many': 610,\n",
       "             'don': 582,\n",
       "             'observed': 132,\n",
       "             'point': 224,\n",
       "             'seventeen': 11,\n",
       "             'because': 631,\n",
       "             'interested': 66,\n",
       "             'problems': 79,\n",
       "             'enough': 176,\n",
       "             'chronicle': 8,\n",
       "             'two': 1139,\n",
       "             'trifling': 13,\n",
       "             'experiences': 12,\n",
       "             'sheet': 30,\n",
       "             'thick': 78,\n",
       "             'pink': 28,\n",
       "             'tinted': 10,\n",
       "             'notepaper': 3,\n",
       "             'lying': 119,\n",
       "             'open': 326,\n",
       "             'table': 297,\n",
       "             'last': 566,\n",
       "             'post': 118,\n",
       "             'aloud': 29,\n",
       "             'note': 116,\n",
       "             'undated': 2,\n",
       "             'either': 294,\n",
       "             'signature': 10,\n",
       "             'address': 77,\n",
       "             'will': 1578,\n",
       "             'call': 198,\n",
       "             'quarter': 47,\n",
       "             'eight': 129,\n",
       "             'o': 258,\n",
       "             'clock': 121,\n",
       "             'desires': 23,\n",
       "             'consult': 20,\n",
       "             'matter': 366,\n",
       "             'deepest': 16,\n",
       "             'moment': 488,\n",
       "             'recent': 55,\n",
       "             'services': 39,\n",
       "             'royal': 112,\n",
       "             'houses': 118,\n",
       "             'europe': 154,\n",
       "             'safely': 12,\n",
       "             'trusted': 17,\n",
       "             'matters': 137,\n",
       "             'importance': 118,\n",
       "             'exaggerated': 29,\n",
       "             'we': 1907,\n",
       "             'quarters': 73,\n",
       "             'received': 281,\n",
       "             'hour': 158,\n",
       "             'amiss': 7,\n",
       "             'visitor': 75,\n",
       "             'wear': 31,\n",
       "             'mask': 13,\n",
       "             'what': 3012,\n",
       "             'means': 254,\n",
       "             'no': 2349,\n",
       "             'data': 18,\n",
       "             'capital': 145,\n",
       "             'mistake': 40,\n",
       "             'theorise': 2,\n",
       "             'insensibly': 3,\n",
       "             'begins': 48,\n",
       "             'twist': 15,\n",
       "             'facts': 73,\n",
       "             'suit': 26,\n",
       "             'theories': 22,\n",
       "             'instead': 138,\n",
       "             'carefully': 73,\n",
       "             'examined': 50,\n",
       "             'writing': 70,\n",
       "             'paper': 178,\n",
       "             'wrote': 150,\n",
       "             'presumably': 9,\n",
       "             'endeavouring': 9,\n",
       "             'imitate': 8,\n",
       "             'processes': 36,\n",
       "             'bought': 56,\n",
       "             'crown': 62,\n",
       "             'packet': 12,\n",
       "             'peculiarly': 15,\n",
       "             'stiff': 21,\n",
       "             'peculiar': 85,\n",
       "             'hold': 115,\n",
       "             'light': 279,\n",
       "             'large': 484,\n",
       "             'e': 137,\n",
       "             'g': 56,\n",
       "             'p': 67,\n",
       "             'woven': 6,\n",
       "             'texture': 7,\n",
       "             'asked': 778,\n",
       "             'maker': 5,\n",
       "             'monogram': 5,\n",
       "             'rather': 220,\n",
       "             'stands': 20,\n",
       "             'gesellschaft': 2,\n",
       "             'german': 197,\n",
       "             'company': 193,\n",
       "             'customary': 20,\n",
       "             'contraction': 62,\n",
       "             'like': 1081,\n",
       "             'co': 31,\n",
       "             'course': 390,\n",
       "             'papier': 2,\n",
       "             'eg': 2,\n",
       "             'let': 507,\n",
       "             'glance': 92,\n",
       "             'continental': 47,\n",
       "             'gazetteer': 2,\n",
       "             'took': 574,\n",
       "             'heavy': 140,\n",
       "             'brown': 72,\n",
       "             'volume': 31,\n",
       "             'shelves': 4,\n",
       "             'eglow': 2,\n",
       "             'eglonitz': 2,\n",
       "             'here': 692,\n",
       "             'egria': 2,\n",
       "             'speaking': 186,\n",
       "             'far': 409,\n",
       "             'carlsbad': 2,\n",
       "             'remarkable': 78,\n",
       "             'being': 919,\n",
       "             'scene': 50,\n",
       "             'death': 331,\n",
       "             'wallenstein': 2,\n",
       "             'its': 1636,\n",
       "             'numerous': 51,\n",
       "             'glass': 117,\n",
       "             'factories': 30,\n",
       "             'mills': 40,\n",
       "             'ha': 76,\n",
       "             'boy': 170,\n",
       "             'sparkled': 6,\n",
       "             'sent': 320,\n",
       "             'great': 793,\n",
       "             'triumphant': 17,\n",
       "             'cloud': 31,\n",
       "             'made': 1008,\n",
       "             'precisely': 25,\n",
       "             'construction': 26,\n",
       "             'sentence': 27,\n",
       "             'frenchman': 103,\n",
       "             'russian': 462,\n",
       "             'uncourteous': 2,\n",
       "             'verbs': 2,\n",
       "             'only': 1874,\n",
       "             'remains': 74,\n",
       "             'therefore': 187,\n",
       "             'discover': 29,\n",
       "             'wanted': 214,\n",
       "             'writes': 21,\n",
       "             'prefers': 3,\n",
       "             'wearing': 88,\n",
       "             'showing': 105,\n",
       "             'face': 1126,\n",
       "             'comes': 92,\n",
       "             'mistaken': 60,\n",
       "             'resolve': 15,\n",
       "             'doubts': 40,\n",
       "             'sharp': 84,\n",
       "             'sound': 220,\n",
       "             'horses': 263,\n",
       "             'hoofs': 25,\n",
       "             'grating': 11,\n",
       "             'wheels': 48,\n",
       "             'curb': 5,\n",
       "             'followed': 330,\n",
       "             'pull': 24,\n",
       "             'whistled': 14,\n",
       "             'pair': 41,\n",
       "             'yes': 689,\n",
       "             'continued': 292,\n",
       "             'glancing': 99,\n",
       "             'window': 187,\n",
       "             'nice': 54,\n",
       "             'brougham': 5,\n",
       "             'beauties': 3,\n",
       "             'hundred': 230,\n",
       "             'fifty': 95,\n",
       "             'guineas': 4,\n",
       "             'apiece': 8,\n",
       "             'money': 327,\n",
       "             'nothing': 647,\n",
       "             'else': 202,\n",
       "             'better': 267,\n",
       "             'bit': 64,\n",
       "             'doctor': 184,\n",
       "             'stay': 75,\n",
       "             'lost': 225,\n",
       "             'boswell': 2,\n",
       "             'promises': 16,\n",
       "             'interesting': 72,\n",
       "             'pity': 76,\n",
       "             'miss': 113,\n",
       "             'client': 34,\n",
       "             'want': 324,\n",
       "             'sit': 90,\n",
       "             'best': 269,\n",
       "             'slow': 66,\n",
       "             'step': 140,\n",
       "             'stairs': 32,\n",
       "             'passage': 111,\n",
       "             'paused': 80,\n",
       "             'immediately': 183,\n",
       "             'outside': 111,\n",
       "             'loud': 65,\n",
       "             'authoritative': 3,\n",
       "             'tap': 11,\n",
       "             'come': 935,\n",
       "             'entered': 283,\n",
       "             'less': 368,\n",
       "             'feet': 180,\n",
       "             'inches': 17,\n",
       "             'height': 37,\n",
       "             'limbs': 68,\n",
       "             'hercules': 5,\n",
       "             'dress': 139,\n",
       "             'rich': 93,\n",
       "             'richness': 3,\n",
       "             'england': 312,\n",
       "             'bad': 156,\n",
       "             'taste': 24,\n",
       "             'bands': 28,\n",
       "             'astrakhan': 2,\n",
       "             'slashed': 4,\n",
       "             'sleeves': 31,\n",
       "             'fronts': 2,\n",
       "             'breasted': 2,\n",
       "             'coat': 173,\n",
       "             'deep': 216,\n",
       "             'cloak': 63,\n",
       "             'thrown': 93,\n",
       "             'shoulders': 126,\n",
       "             'lined': 33,\n",
       "             'flame': 16,\n",
       "             'coloured': 22,\n",
       "             'silk': 51,\n",
       "             'secured': 49,\n",
       "             'neck': 204,\n",
       "             'brooch': 2,\n",
       "             'consisted': 39,\n",
       "             'single': 174,\n",
       "             'flaming': 9,\n",
       "             'boots': 92,\n",
       "             'extended': 76,\n",
       "             'halfway': 20,\n",
       "             'calves': 4,\n",
       "             'trimmed': 9,\n",
       "             'tops': 4,\n",
       "             'fur': 39,\n",
       "             'completed': 26,\n",
       "             'impression': 68,\n",
       "             'barbaric': 3,\n",
       "             'opulence': 4,\n",
       "             'suggested': 70,\n",
       "             'appearance': 136,\n",
       "             'carried': 283,\n",
       "             'broad': 93,\n",
       "             'brimmed': 5,\n",
       "             'hand': 835,\n",
       "             'wore': 59,\n",
       "             'upper': 131,\n",
       "             'extending': 36,\n",
       "             'past': 224,\n",
       "             'cheekbones': 5,\n",
       "             'vizard': 2,\n",
       "             'apparently': 69,\n",
       "             'raised': 213,\n",
       "             'lower': 197,\n",
       "             'appeared': 198,\n",
       "             'hanging': 43,\n",
       "             'straight': 125,\n",
       "             'chin': 31,\n",
       "             'suggestive': 12,\n",
       "             'resolution': 58,\n",
       "             'pushed': 82,\n",
       "             'length': 64,\n",
       "             'obstinacy': 8,\n",
       "             'harsh': 23,\n",
       "             'voice': 463,\n",
       "             'strongly': 42,\n",
       "             'marked': 139,\n",
       "             'accent': 19,\n",
       "             'uncertain': 31,\n",
       "             'pray': 80,\n",
       "             'seat': 171,\n",
       "             'colleague': 8,\n",
       "             'dr': 49,\n",
       "             'occasionally': 90,\n",
       "             'cases': 454,\n",
       "             'whom': 490,\n",
       "             'honour': 17,\n",
       "             'count': 749,\n",
       "             'von': 12,\n",
       "             'kramm': 3,\n",
       "             'nobleman': 12,\n",
       "             'understand': 413,\n",
       "             'discretion': 14,\n",
       "             'trust': 69,\n",
       "             'extreme': 73,\n",
       "             'prefer': 22,\n",
       "             'communicate': 16,\n",
       "             'alone': 338,\n",
       "             'rose': 244,\n",
       "             'caught': 91,\n",
       "             'wrist': 69,\n",
       "             'back': 747,\n",
       "             'chair': 136,\n",
       "             'none': 111,\n",
       "             'say': 756,\n",
       "             'anything': 380,\n",
       "             'shrugged': 36,\n",
       "             'begin': 98,\n",
       "             'binding': 19,\n",
       "             'absolute': 57,\n",
       "             'secrecy': 19,\n",
       "             'years': 572,\n",
       "             'end': 466,\n",
       "             'present': 330,\n",
       "             'weight': 71,\n",
       "             'influence': 139,\n",
       "             'european': 100,\n",
       "             'history': 440,\n",
       "             'promise': 68,\n",
       "             'excuse': 54,\n",
       "             'strange': 221,\n",
       "             'august': 71,\n",
       "             'person': 186,\n",
       "             'employs': 3,\n",
       "             'wishes': 43,\n",
       "             'agent': 26,\n",
       "             'unknown': 88,\n",
       "             'confess': 37,\n",
       "             'once': 570,\n",
       "             'called': 451,\n",
       "             'exactly': 48,\n",
       "             'aware': 53,\n",
       "             'dryly': 6,\n",
       "             'circumstances': 108,\n",
       "             'delicacy': 12,\n",
       "             'precaution': 10,\n",
       "             'taken': 439,\n",
       "             'quench': 4,\n",
       "             'grow': 75,\n",
       "             'seriously': 64,\n",
       "             'compromise': 72,\n",
       "             'families': 46,\n",
       "             'speak': 256,\n",
       "             'plainly': 40,\n",
       "             'implicates': 6,\n",
       "             'house': 662,\n",
       "             'ormstein': 3,\n",
       "             'hereditary': 15,\n",
       "             'kings': 28,\n",
       "             'murmured': 19,\n",
       "             'settling': 17,\n",
       "             'closing': 36,\n",
       "             'glanced': 177,\n",
       "             ...})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NWORDS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  编辑距离: ###\n",
    "两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#返回所有与单词 w 编辑距离为 1 的集合\n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion\n",
    "               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition\n",
    "               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration\n",
    "               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与 something 编辑距离为2的单词居然达到了 114,324 个\n",
    "\n",
    "优化:在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词,只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#返回所有与单词 w 编辑距离为 2 的集合\n",
    "#在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词\n",
    "def edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等.但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def known(words): return set(w for w in words if w in NWORDS)\n",
    "\n",
    "#如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的\n",
    "def correct(word):\n",
    "    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n",
    "    return max(candidates, key=lambda w: NWORDS[w])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
